在文本中提取时间关系是自然语言理解的一个至关重要但充满挑战的问题。根据事件之间的距离,模型必须学会从事件对周围的本地和全局环境中进行不同的信息以进行时间关系预测。学习如何融合这些信息已证明对基于变压器的语言模型具有挑战性。因此,我们介绍了mulco:多尺度对比的共同训练,这是一种更好地融合本地和全球情境化特征的技术。我们的模型使用基于BERT的语言模型编码本地上下文和图形神经网络(GNN)来表示全局文档级句法和时间特征。与以前的最先进方法不同,该方法在多视图功能上使用简单的串联或使用复杂的强化学习方法选择最佳句子,我们的模型Co-Trains GNN和BERT模块使用多规模的对比度学习目标。 GNN和BERT模块通过将GNN多层多跳子图(即,全局上下文嵌入)和BERT输出(即局部上下文嵌入)进行对比,从而学习了协同参数化。我们从经验上证明,与当前的最新技术相比,Mulco提供了改进的使用Bert和GNN编码的本地和全球环境的能力。我们的实验结果表明,Mulco在几个时间关系提取数据集上实现了新的最新结果。
translated by 谷歌翻译
医疗成果的预测模型对提高临床决策具有很强的希望。这些型号培训培训,诸如临床笔记的富患者数据,将许多患者信号汇总到结果预测中。然而,基于AI的临床模型通常是从​​初始循证药物(EBM)的突出范式的孤立的临床模型,其中医学决策是基于来自现有文献的明确证据。在这项工作中,我们介绍了帮助桥接ebm和基于AI的临床模型之间的这种差距的技术,并表明这些方法可以提高预测准确性。我们提出了一种新颖的系统,可根据重症监护(ICU)患者信息自动检索患者特异性文献,汇总相关文件并将其融合在内的内部录音,以形成结果预测。与强大的最近基线相比,我们的模型能够在三个具有挑战性的任务上提高预测准确性;对于住院医生的死亡率,我们能够通过超过25%的大幅度提高10%的精度。
translated by 谷歌翻译
自然语言理解(NLU)通过大型基准驱动的大规模进展,与转让学习的研究配对扩大其影响。基准是由一小部分频繁现象的主导,留下了一条长长的不常见现象。在这项工作中,我们反映了问题:转移学习方法足够地解决了长尾的基准训练模型的表现吗?由于基准未列出包括/排除的现象,我们使用宏观级别的宏观尺寸(如经验丰富的类型,主题等)概念化。我们评估通过100个代表性论文转让学习的定性荟萃分析来转移学习研究的趋势nlu。我们的分析问了三个问题:(i)哪个长尾尺寸进行转移学习研究目标? (ii)哪种特性有助于适应方法改善长尾的性能? (iii)哪种方法差距对长尾性能有最大的负面影响?我们对这些问题的答案突出了在长尾的转让学习中的未来研究的主要途径。最后,我们展示了一个案例研究,比较了各种适应方法对临床叙事的性能,以表明系统性开展的元实验如何提供能够沿着这些未来的途径取得进展的见解。
translated by 谷歌翻译
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications.
translated by 谷歌翻译
Language models are widely deployed to provide automatic text completion services in user products. However, recent research has revealed that language models (especially large ones) bear considerable risk of memorizing private training data, which is then vulnerable to leakage and extraction by adversaries. In this study, we test the efficacy of a range of privacy-preserving techniques to mitigate unintended memorization of sensitive user text, while varying other factors such as model size and adversarial conditions. We test both "heuristic" mitigations (those without formal privacy guarantees) and Differentially Private training, which provides provable levels of privacy at the cost of some model performance. Our experiments show that (with the exception of L2 regularization), heuristic mitigations are largely ineffective in preventing memorization in our test suite, possibly because they make too strong of assumptions about the characteristics that define "sensitive" or "private" text. In contrast, Differential Privacy reliably prevents memorization in our experiments, despite its computational and model-performance costs.
translated by 谷歌翻译
Finding an initial noise vector that produces an input image when fed into the diffusion process (known as inversion) is an important problem in denoising diffusion models (DDMs), with applications for real image editing. The state-of-the-art approach for real image editing with inversion uses denoising diffusion implicit models (DDIMs) to deterministically noise the image to the intermediate state along the path that the denoising would follow given the original conditioning. However, DDIM inversion for real images is unstable as it relies on local linearization assumptions, which result in the propagation of errors, leading to incorrect image reconstruction and loss of content. To alleviate these problems, we propose Exact Diffusion Inversion via Coupled Transformations (EDICT), an inversion method that draws inspiration from affine coupling layers. EDICT enables mathematically exact inversion of real and model-generated images by maintaining two coupled noise vectors which are used to invert each other in an alternating fashion. Using Stable Diffusion, a state-of-the-art latent diffusion model, we demonstrate that EDICT successfully reconstructs real images with high fidelity. On complex image datasets like MS-COCO, EDICT reconstruction significantly outperforms DDIM, improving the mean square error of reconstruction by a factor of two. Using noise vectors inverted from real images, EDICT enables a wide range of image edits--from local and global semantic edits to image stylization--while maintaining fidelity to the original image structure. EDICT requires no model training/finetuning, prompt tuning, or extra data and can be combined with any pretrained DDM. Code is available at https://github.com/salesforce/EDICT.
translated by 谷歌翻译
We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices based on the application requirements. We combine these with a suite of low-level optimizations to achieve a new Pareto frontier on the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parameter models that outperforms the FasterTransformer suite of benchmarks. We further show that with appropriate partitioning, the lower memory requirements of multiquery attention (i.e. multiple query heads share single key/value head) enables scaling up to 32x larger context lengths. Finally, we achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens, while supporting a long 2048-token context length on the PaLM 540B parameter model.
translated by 谷歌翻译
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
translated by 谷歌翻译
预先训练的大语言模型(LLM)(例如OpenAI Codex)通过从非正式自然语言(NL)意图中生成自然代码来自动化编码的重要方面。但是,生成的代码无法满足用户意图的任何正确性保证。实际上,很难定义正确性的概念,因为自然语言可能是模棱两可的,并且缺乏正式的语义。在本文中,我们通过提出测试驱动的用户形式化(TDUIF)的工作流程来解决以上问题的第一步,该工作流利用轻量级用户的反馈共同将用户的意图正式化为测试(部分规范) ),(b)生成符合正式用户意图的代码。要对算法进行可扩展的大规模自动化评估,而无需循环中的用户,我们描述了如何使用参考解决方案模拟用户与高保真性的互动。我们还描述并实施了几种算法组件(包括突变和排名一组测试)的替代实现,这些实现可用于有效解决TDUIF问题。我们已经开发了一个系统的Ticoder,该系统实现了多种解决方案来进行TDUIF,并将其对MBPP学术代码生成基准测试的相对有效性进行了比较。在MBPP上使用OpenAI Codex LLM的结果很有希望:我们的最佳算法将通行证@1代码生成准确度指标从48.39%提高到单个用户查询,最高为85.48%,最多可达55.48%,最多可提供5个用户查询。其次,我们可以生成与用户意图在1.69个用户查询中的非平凡功能单位测试,该数据集为90.40%的示例,用于此数据集。
translated by 谷歌翻译
由摩尔定律驱动的计算系统性能的改善已改变了社会。由于这种硬件驱动的收益放缓,对于软件开发人员而言,专注于开发过程中的性能和效率变得更加重要。尽管几项研究表明了这种提高的代码效率的潜力(例如,与硬件相比,2倍更好的世代改进),但在实践中解锁这些收益是充满挑战的。关于算法复杂性以及硬件编码模式的相互作用的推理对于普通程序员来说可能是具有挑战性的,尤其是当与围绕开发速度和多人发展的务实约束结合使用时。本文旨在解决这个问题。我们分析了Google Code JAM竞争中的大型竞争编程数据集,并发现有效的代码确实很少见,中位数和第90%的解决方案之间的运行时间差异为2倍。我们建议使用机器学习以提示的形式自动提供规范反馈,以指导程序员编写高性能代码。为了自动从数据集中学习这些提示,我们提出了一种新颖的离散变异自动编码器,其中每个离散的潜在变量代表了不同的代码编辑类别,从而提高了性能。我们表明,此方法代表代码效率的多模式空间比序列到序列基线更好地编辑,并生成更有效的解决方案的分布。
translated by 谷歌翻译